Getting Started¶
Spyglass is a ready to use app that you can use to monitor new content across several predefined websites.
After configuring spyglass, you may run several instances of crawlie that will connect to your application’s web server through a RESTful API provided by django-tastypie.
This tutorial assumes you have a basic understanding of Django.
News Spyglass¶
Throughout this tutorial we will be referencing an example configuration called News Spyglass.
It’s a news monitoring service that receives queries from users containing:
- Search Terms String
- A Site name from a selection box
- A Persistent choice box that when selected the query won’t be completed after one notification.
Additionally, News Spyglass has one model, which will be the one populated by spyglass:
class NewsStory(models.Model):
headline = models.CharField(max_length=200, blank=False)
category = models.CharField(max_length=200, blank=False)
subtitle = models.CharField(max_length=200, blank=False)
def __unicode__(self):
return self.headline
class Meta:
verbose_name_plural = "newsstories"
The source code for News Spyglass can be found at the example folder of the github repo.
Installation¶
first off, install the requirements. They can also be found in the pip-friendly spyglass/requirements.txt:
- Python 2.7+
- Django 1.4.3
- django-tastypie 0.9.11+
You can install them all with pip install -r requirements.txt
Configuration¶
Add
spyglass
toINSTALLED_APPS
in yoursettings.py
.Add
METAMODEL='<path-to-your-model>'
insettings.py
for the metamodel. In our example, it’sMETAMODEL=core.models.NewsStory
The metamodel is a model of your application that will be populated when new results are found by the crawlers. All of its fields will later have to correspond to XPaths on each given Site.
- In your project’s urls.py, add
import spyglass.urls
and url(r'^', include(spyglass.urls))
to add the spyglass urls to the root level.You can include spyglass’ urls at any level but you’ll have to edit the crawler’s serverconf appropriately.
- In your project’s urls.py, add
Sync the database with
./manage.py syncdb
Add the sites to be crawled to the spyglass.Site model usign django’s admin panel, direct database access or fixtures. For example:
[ { "model": "spyglass.Site", "pk": 1, "fields": { "name":"Example", "url":"http://example.com", "poll_time": 5 } }, { "model": "spyglass.Site", "pk": 2, "fields": { "name":"Example2", "url":"http://example2.com", "poll_time": 0 } }, ] The ``poll_time`` field indicates a time interval (in minutes) between each visit by any crawler. Sites with poll_time set to 0 will be crawled in a round-robin fashion.
For each Site, add a DataField entry for each field of the metamodel, and a corresponding XPath. For example:
[ { "model": "spyglass.DataField", "pk": 1, "fields": { "site": 1, "field_name": "Headline", "xpath": "/html/body/div[4]/div/div/div[2]/div/div/h2/a" } }, { "model": "spyglass.DataField", "pk": 2, "fields": { "site": 1, "field_name": "Category", "xpath": "/html/body/div[4]/div/div/div[2]/div/div/div/a" } }, { "model": "spyglass.DataField", "pk": 3, "fields": { "site": 1, "field_name": "Subtitle", "xpath": "/html/body/div[4]/div/div/div[2]/div/div/div/p" } } ]
Currently, there is no way to exclude fields of the metamodel from this process.